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Multi-task additive models with shared transfer 
functions based on dictionary learning 

Alhussein Fawzi, Mathieu Sinn, and Pascal Frossard 


Abstract 

Additive models form a widely popular class of regression models which represent the relation between covariates 
and response variables as the sum of low-dimensional transfer functions. Besides flexibility and accuracy, a key benefit 
of these models is their interpretability: the transfer functions provide visual means for inspecting the models and 
identifying domain-specific relations between inputs and outputs. However, in large-scale problems involving the 
prediction of many related tasks, learning independently additive models results in a loss of model interpretability, 
and can cause overfitting when training data is scarce. We introduce a novel multi-task learning approach which 
provides a corpus of accurate and interpretable additive models for a large number of related forecasting tasks. Our 
key idea is to share transfer functions across models in order to reduce the model complexity and ease the exploration 
of the corpus. We establish a connection with sparse dictionary learning and propose a new efficient fitting algorithm 
which alternates between sparse coding and transfer function updates. The former step is solved via an extension of 
Orthogonal Matching Pursuit, whose properties are analyzed using a novel recovery condition which extends existing 
results in the literature. The latter step is addressed using a traditional dictionary update rule. Experiments on real- 
world data demonstrate that our approach compares favorably to baseline methods while yielding an interpretable 
corpus of models, revealing structure among the individual tasks and being more robust when training data is scarce. 
Our framework therefore extends the well-known benefits of additive models to common regression settings possibly 
involving thousands of tasks. 


Index Terms 

Additive models, nonparametric regression, dictionary learning, sparse representations, multi-task learning. 

I. Introduction 

Additive models are a widely popular class of nonparametric regression models which have been extensively 
studied theoretically and successfully applied to a wide range of practical problems in signal processing and machine 
learning m, la, El. The key ingredient of additive models are transfer functions that explain the effect of covariates 
on the response variable in an additive manner. Besides being flexible (e.g., allowing for the modeling of nonlinear 
effects for both continuous and categorical covariates) and yielding good predictive performance, an important 
selling point of additive models is their interpretability. In particular, the transfer functions provide intuitive visual 
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means for application experts to understand the models and explore the relationship between input and output 
signals of the system under study. 

In many real-world data modeling settings, one faces the problem of forecasting a large number (e.g., several 
thousands) of related tasks. In this case, learning additive models independently for each task has several disad¬ 
vantages. Firstly, the number of models would be too large for a domain expert to visually inspect all the transfer 
functions, hence - in essence - the corpus of models loses its interpretability from a human point of view. Secondly, 
independently learning the models ignores structure and commonality among the tasks. Thirdly, when training data 
is scarce, learning the models independently is prone to overfitting the data. 

To overcome these challenges, we introduce a novel multi-task learning framework for additive models. Intuitively, 
the key idea is to share transfer functions across tasks that exhibit commonality in their relationships between input 
and output variables. More specifically, each individual task is modeled as a weighted sum of transfer functions 
chosen from a candidate set which is common to all tasks, and the cardinality of which is small relative to the 
total number of tasks. Our algorithm for solving the multi-task additive model learning problem uses an intrinsic 
connection with sparse dictionary learning 11, Q, 0. More specifically, we reformulate the fitting problem as a 
special form of dictionary learning with additional constraints; leveraging recent advances in the field, we propose 
a novel fitting approach that alternates between updates of the transfer functions and the weights that scale these 
functions. We introduce a novel algorithm for updating the coefficients that scale the transfer functions, called Block 
Constrained Orthogonal Matching Pursuit (BC-OMP), which extends conventional Orthogonal Matching Pursuit Q, 
0. Furthermore, we derive novel coherence conditions for the accurate recovery of the optimal solution which 
are interesting in their own right as they extend existing theory. Transfer functions, which correspond to dictionary 
elements in our dictionary learning analogy, are updated using a traditional dictionary step update M- 

In the experimental part of our paper, we apply the proposed algorithm to synthetic and real-world electricity 
demand data. Synthetic results show that the proposed approach accurately learns transfer functions from noisy 
data. In addition, the proposed method is shown to outperform baseline linear and non-linear regression methods 
in terms of prediction accuracy. In a second experiment, we use a dataset of 4,066 smart meter time series data 
from Ireland, and show that our approach yields predictive performance comparable to baseline methods while only 
using a small number of candidate functions; interestingly, the discovered commonality of tasks corresponds to 
classes of residential and different types of enterprise customers. When using only a small fraction of the training 
data, our approach yields more robust results than independent learning and hence inherits the benefits of traditional 
multi-task learning. In a final experiment, we apply our multi-task learning algorithm to a single-task problem to 
improve the prediction accuracy of traditional additive model learning, while maintaining the number of learned 
transfer functions small. 

Over the past decade, many works have shown the benefits of multi-task learning over independently learning the 
tasks ifTOl . ifTTIl . lfT2l . and different approaches to multi-task learning were considered. In ifTSl . the authors impose 
the linear weight vectors of different tasks to be close to each other. The work in M constrains the weight vectors 
to live in a low-dimensional subspace. Still in the context of linear models, the authors of ca assume that the 
tasks are clustered into groups, and that tasks within a group have similar weight vectors. In the context of additive 
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models, M proposes new families of nonparametric models that enforce the selected covariates to be the same 
across tasks. This setting is particularly relevant for regression tasks involving a large number of covariates p, and 
the algorithm in llThl extracts a common set of covariates for the tasks. Our work significantly differs from ifT^ in 
several aspects. While m enforces a common set of covariates across tasks, the transfer functions are different. 
In other words, their approach only leverages commonality with respect to which covariates affect the dependent 
variable, but not how they affect it, leading to a number of transfer functions that is still too large for inspection 
by human experts. By imposing a common set of candidate transfer functions across tasks, we limit the number 
of transfer functions, and obtain interpretable models even for problems involving thousands of tasks. Moreover, 
unlike Km, we consider a setting where all covariates are relevant for the task at hand. Hence, in this paper, we 
are typically interested in problems involving a small number of input covariates p, and a very large number of 
tasks N. Our approach shows that, by learning a number of transfer functions that is much smaller than pN, it is 
possible to achieve comparable or better performance than models involving a much larger number of parameters. 

The paper is structured as follow: Sec.|II]introduces notation and provides a review of additive models. In Sec.[nl| 
we formulate the multi-task additive model learning problem and establish the connection with sparse dictionary 


learning. The algorithm for solving the multi-task problem is explained in Sec. IV and the theoretical analysis of 
recovery conditions is provided in Sec.|V] In Sec.|Vl]we describe our experiments on synthetic and real-world data; 


conclusions and an outlook on future research are given in Sec. VII 


II. Preliminaries 

A. Notations 

We use boldface notation for vectors and matrices. Moreover, we use [n] to refer to the set n}. Given a 

vector z, we denote by |jz||o the £o “norm”, that counts the number of nonzero elements in z. Also, we denote by 
(g) the Kronecker product operation. If Z G a matrix, vec(Z) G denotes the vectorization of Z, 

obtained by stacking the columns of Z, and Z^ denotes the Moore-Penrose pseudo-inverse. Moreover, we use the 
notation z G Z to denote that z is one of the columns of Z. Finally, Z > 0 denotes the entry-wise non-negativity 
constraint. 


B. Additive models review 

We first briefly review additive models. Let {xij,i G [n],j G [p]} and {yi,i G [n]} denote respectively the 
observed covariates and response variable. Here, n is the number of observations and p the number of covariates. 
Additive models have the form: 

p 

Vi = d + + 

where // is the intercept and Ci is assumed to be a white noise process. The transfer functions fj represent the effect 
of a covariate on the response variable. The additive model is illustrated in Fig. To ensure unique identification 
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Fig. 1. Additive model diagram 


of the fj’s, we assume that transfer functions are centered: ~ ^ ^ W- Nonlinear transfer 

functions of continuous covariates are commonly modeled as smoothing splines 0, m, i.e.. 




( 1 ) 


where Pjt denotes the spline coefficients, (j)jt the B-spline basis functions, and Tj the number of basis splines. Using 
this representation, estimating the transfer functions therefore amounts to the estimation of the spline coefficients 
/3jt and the intercept /i. We consider the following htting problem with centering constraints: 

n / P Tj 

n '^j 

subject to X! X! 0 ^ W- 

One can convert the above problem to an unconstrained optimization problem by centering the response and basis 
functions. Specihcally, let ^ Sj(z) = [ 0 ^ 1 ( 2 ;) - ({>, 1 , ■■■, (j^jT^iz) - “d 

si(xii)^ ... Sp(a:ip)^ 


S = 




We dehne the vectorized spline coefficients (3 = 




^p{Xnp') 
1 T 


/3. 


with f3j = 


3ji ... (3jTj 


( 2 ) 


. The above 


constrained htting problem is then equivalent to the following unconstrained least squares problem: 

min||y-S^||^, 

where y denotes the centered response variables y = [yi — y, — y]^, with y = ^/nYYi Vi- order to 

avoid overhtting, a quadratic penalize!' is commonly added, leading to the problem: 

mm||y-S/3||2+/3^S/3, 

with a regularization matrix S. The penalized minimization problem has the closed form solution: 

^ = (S^S + S)-iS^y, 

provided that S^S + S is non-singular. 
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III. Multi-task additive model with shared transeer functions 

We now introduce our new multi-task additive model with shared transfer functions. We assume a A^-task 
regression problem where G [n],j G [p],w G [W]} are the covariates, and G [n],TO G [iV]} 

denotes the response variables, where the superscript (m) is the task index. We further assume without loss of 
generality that the response variables have zero mean. Our multi-task model is given as follows; 

p Li 


-f C 


(m) 


with 


ia'-’iic 


i=i 1=1 

< 1 and > 0 for all j G [p],m £ [A^]. 


(3) 


Note that, in our new model, the response variables are weighted linear combinations of p transfer functions, each 
of which is selected from the set Fj = {fji,l G [Lj]} which contains Lj candidate transfer functions that model the 
effects of the covariate j. The fg norm constraint on the weights A^™^ prevents two transfer functions from the set 
Fj to be active for the same task. Hence, only one transfer function captures the effect of a covariate in a response 
variable. This constraint is crucial, as it disallows the creation of “new” transfer functions from the candidate ones 
by linearly combining them. While the transfer functions fji are common to all the tasks, the non-negative weights 
are task-specific and permit to scale the transfer functions specifically for each task. This offers extra flexibility 
as a wide range of tasks can be modeled using the model in Eq. Q while keeping the number of (standardized) 
candidate transfer functions small. As we will see in Sec. VI the non-negativity constraint in Eq. ([^ facilitates the 
interpretation of the activation of the same transfer functions across different tasks as commonality; without this 
constraint, the same transfer functions could represent exactly opposite effects, e.g., higher temperatures leading 
to higher electricity demand for one task, and leading to lower demand for another one. Our multi-task model is 
illustrated in Eig. 


Similarly to what is done with single-task additive models (Sec. II-B i, we model transfer functions using 
smoothing splines. Specifically, we write: 


(4) 


where Sj and Pji denote the centered spline basis functions and coefficients, respectively. Using this representation, 
we rewrite the model in Eq. ([^ in the following vector form: 


Vm G [A^], 






j=l 


where = 


(.«) ■■■ -.(A?) 


T T 


, B, = 




ji 




, and is a Gaussian iid random 


vector with zero mean. The model htting then consists in finding admissible and {A^ „jG[iV] that 

minimize the sum of squared residuals, while avoiding oveffitting. We therefore write the problem as follows: 

2 


(P): 


mm 

(m) 




N 

E 




-Es 

3 = 1 


(™)b 

1 3 




subject to ||A^-^^||o < 1 and > 0 for all j, m, 
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Fig. 2. Multi-task additive model diagram. J^j denotes the set of Lj candidate transfer functions that model the effects of covariate j. The sets 
1 ^ J ^ P common to all tasks. Each task is modeled as a linear combination of transfer functions chosen from the sets ^i,... ,^p. 
In this diagram, the symbol ^ denotes an arbitrary index in [Lj]. 


where $7 is a regularization term that prevents model overfitting. Note that, unlike traditional additive model learning, 
the above problem has two types of unknowns, that is, weights and transfer functions. In this paper, we use the 
following regularization function 

(5) 

with the regularization matrix S = and z/ > 0, and b is the vector formed by concatenating vec(Bj), 1 < j < p. 
Note that this regularizer penalizes large coefficients of smoothing splines with the strength of this effect tuned by 
v. 

The fitting problem (P) is inherently related to sparse dictionary learning Q where the goal is to find the 
dictionary D and sparse codes C that minimize 


IIY — DC|||. subject to ||c||o < p for all c G C, 


with Y = G 


^nxN 


, and p is the desired level of sparsity. To simplify the exposition of the analogy, 

(m) 


let us consider the multi-response scenario where covariates are equal across tasks (xlj ' = Xij for all m). In this 
case, we have = Sj for all m G [N\. We define the subdictionaries (or blocks) T)j = S^Bj and the global 


dictionary D 


Di 


D, 


. The problem (P) can be rewritten as follows; 


||Y-DA||^ + f7({B,}^0i) 

subject to ||Aj"^^||o < 1 for all j, m and A > 0, 
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with 


aW ., 

,. A^ 


Af). 


A = 


Hence, the difference between sparse dictionary learning and problem (P) essentially lies in the underlying sparsity 
constraints: while in the former one the only constraint is that sparse codes have no more than p nonzero entries, 
in the latter they are further constrained to have at most one nonzero entry for each subdictionarj^ Based on this 
analogy, we introduce in the next section a novel algorithm for efficiently approximating the solution of problem 

(P). 


IV. Learning algorithm 

The problem of dictionary learning has proved challenging. In fact, even if the dictionary is known, it can be 
NP-hard to represent a vector as a linear combination of the columns in the dictionary ini. Problem (P) inherits 
the difficulty of dictionary learning, and we therefore propose an approximate algorithm that solves successively 
for the weights and spline coefficients {Bj}. The proposed algorithm can be seen as an extension of the 

popular MOD algorithm which alternates between sparse coding and dictionary updates. 


A. Weights update 


We assume that the spline coefficients matrices {Bj} are given, and we define S 


We define the columns of each subdictionary = 

l("i) ; 


d(™) 

S,1 


j(™) 


3 

to be the atoms of Hence, an 

(m) 


atom I ' is obtained by applying the transfer function fji to all the observations of the jth covariate; d^ j = 


/77O 


'ij ) 


f3li<7) 


The weight estimation problem is given by 


{A 


min 

(m) 


N 

, E 

Sjyrn rn — 1 


ri^) _ 


Ed 


m) , (m) 

3 7 


7 = 1 


subject to ||Aj™^||o < 1 and A^™^ > 0 for all j,m. 


The above problem can be seen as computing the best non-negative p-sparse approximations of the signals 
in the dictionary ... |Dp™^], provided that no two active dictionary atoms belong to the same 

subdictionary. We first note that this problem is separable and therefore can be solved independently for each task. 
Next, we simplify the problem and drop the non-negativity constraints on Following the approach used in 


^Note that there are a few other differences. For example, in dictionary learning, atoms are usually unconstrained unit-norm vectors, while 
in our model they ai‘e constrained to be linear combinations of B-splines. Moreover, our problem involves an additional regularization function. 
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ca, lua, non-negative coefficients can then be obtained in a post-processing step by including the negative of 
each atom in the dictionarj0 as we have; 

max(0, 

max(0, —A^"*^) 

For a single task, our weight estimation problem is written; 


Ed 


(m) , (m) 

j 


n 


D 


-D 


min 




subject to ll-^jllo < 1 for all j G [p]. 


To solve this problem, we propose the iterative algorithm Block Constrained Orthogonal Matching Pursuit (BC- 
OMP). It is an extension of the popular Orthogonal Matching Pursuit algorithm Q, lH which is an efficient greedy 
method for solving sparse coding problems. At each iteration of the algorithm, we select the dictionary atom which 
has the strongest correlation with the residual, provided it belongs to an available subdictionary whose index is 
listed in Aj-i- The residual is then updated using an orthogonal projection onto the selected atoms. The availability 
set Aj is in turn updated to prevent selecting two atoms from the same subdictionary. Details of our approach are 
presented in Algorithm 


B. Spline coefficients update 


We now solve the problem of learning the spline coefficients given the fixed weights A^'"^. We note that; 

= f^((A«)^®s5’"))vec(B,) 

i=i i=i 




vec(Bi) 

vec(Bp) 


A zMb. 


Thus, the objective function becomes; 


N 


||yM _ -f b^Sb = ||vec(Y) - Zb||^ -f b^Sb, 


m—1 


^This post-processing step results in doubling the number of candidate transfer functions. That is, for covariate j, we have 2Lj candidate 
transfer functions, as we consider the positive and negative versions of every (original) transfer function. Note also that this post-processing 
approach is an approximation and is therefore not guaranteed to give the exact solution to the original (constrained) problem, with 2Lj transfer 
functions for each covariate j. 
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Algorithm 1 BC-OMP 

Input: Subdictionaries Di,..., Dp, signal y. 
Output: Weight vectors Ai,..., Ap. 


Initialization: 

Available covariates: An ^ p}, residual: rp ^ y, selected atoms: Ug ^ 0, weight vectors: \j ^ 

0 for all j G [p\. 

for all j = 1,... ,p do 
Selection step: 

r; 7 1 l(D-i7dfc,z)| 

\ki,lA G- argmax argmax —^—n-■ 

keA,-i ie{i,...,L,} lldfe.zlb 


Update step: 


end for 

for all j = 1, ... do 

Set ^ Cp[j]. 

end for 


Cj G- U]y,rj 




Cj G- UW, rj^y — VjCj. 







yd) 


z(i) 

with vec(Y) = 

y(^) 

and Z = 

ZW 


. The minimum of the above least-squares program with respect to b is 


given by: 


b= (Z^Z + S)-^Z^vec(Y). 


(6) 


Given the spline basis coefficients Bj, the transfer functions can then be obtained by multiplying the obtained 
coefficients with the spline basis vectors (Eq. 0)- 


C. Complete learning algorithm 

The complete algorithm is shown in Algorithm Using a random initialization of the spline coefficients, we 
iterate through the weights update and spline coefficients update steps, until a termination criterion is met. In this 
paper, we terminate the algorithm after a fixed number of iterations. In a final step, the matrices Bj and A^"*^ are 
modified to ensure the non-negativity of the weights, as discussed in Section |IV-A| 

We briefly analyze the computational complexity of Algorithm We assume for simplicity that Lj = L for 
all j G [p]. The BC-OMP algorithm involves a selection and an update step whose complexity are 0{Lpn) and 
0{pn+p'^) respectively. Assuming that p <n, the selection step dominates and the overall complexity of BC-OMP 
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Algorithm 2 Multi-task additive model algorithm 

Input: Covariates response variables parameters Li,... ,Lp and v. 

Output: Spline coefficients scaling weights „j- 


Initialize Bi,..., Bp with random entries from Af{0, 1). 
while not converged do 

Weights update: Use BC-OMP for each response variable to estimate {A^™^}jg[p] 
Spline coefficients update: Use Eq. (|^ to update the spline coefficients. 

end while 

Ensure the non-negativity of the weights: 


for all j,m. 


B, 



Bj Bj , 

max(0, Aj™^) 
max(0, — A^™^) 


is 0{Lp^n). Doing this operation for each task results in a complexity of 0{nNLp^). The spline coefficients 
update involves the computation in Eq. 0. To compute the complexity of this operation, note that Z has nN rows 
and LTp columns (where T is the number of spline basis functions, or equivalently the number of columns of 
assumed to be equal for all covariates j for simplicity). Eor typical problems, this matrix is tall and the 
complexity is driven by the computation of Z^Z, which is of complexity 0{nN{LTp)'^). Hence, assuming that 
we run Algorithm for a fixed number of iterations, the complexity of our overall algorithm is 0{nN{LTp)'^). 

Our algorithm is therefore linear in the number of tasks N, and dimension n, while being quadratic with respect 
to the number of candidate transfer functions per task L. In comparison, learning an additive model independently 
for each task has complexity 0{nN{Tp)^). Compared to the independent additive model approach, the price to 
pay of our algorithm is therefore which remains small in most problems of interest. 

V. Recovery condition eor BC-OMP 

We analyze in this section the weights update algorithm BC-OMP. While BC-OMP represents one building block 
of the global algorithm (Algorithm]^, an analysis of the recovery conditions of BC-OMP is important as it provides 
insights onto the success of our algorithm. In addition, our recovery condition also applies to OMP and is interesting 
in its own right. 

We suppose that y is a superposition of p elements in D = [Di| ... |Dp] such that no two active elements belong 
to the same subdictionary Dj, i.e., y = - simplicity, we further assume that the atoms dj^i- are 

linearly independent and the 7 j are all nonzercj^ We develop a sufficient condition for the recovery of the correct 

^Otherwise, the signal has a representation with fewer atoms and one can remove the unused covariates j S [p]. 
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atoms using BC-OMP. 

We first note that the difference between OMP and BC-OMP algorithms lies in their search space: while OMP 
selects atoms from the dictionary D having maximal inner product with the residual, BC-OMP further imposes a 
constraint that the selected atom belong to an available subdictionary where no atoms have been previously selected. 
It follows that if OMP succeeds in the recovery of the correct atoms of y, the same holds for BC-OMP. Therefore, 
any condition that guarantees the recovery of OMP is a fortiori a recovery condition for BC-OMP. Many OMP 
recovery conditions have been proposed in the literature (see e.g., 1201, ED). The following theorem in l20l gives 
a popular and practical recovery condition of OMP for the global dictionary D: 


Theorem 1 ( 1201 '). Let 


A 

H = 


max I (d, d ) 

d.d'GD 

d^id' 


OMP recovers every superposition of p atoms from D whenever the following condition is satisfied: 

P<\ + 1) • 


(7) 


The quantity p, called coherence, measures the similarity between dictionary atoms. The values of p that are close 
to 1 may violate the recovery condition in Eq. thus leaving us without any guarantee that OMP or BC-OMP 
will recover the correct atoms. Unfortunately enough, in our multi-task learning framework, p is typically close to 
1. To see this, note that the inner product between two atoms of the same subdictionary is equal to (d_, /,dj i/) = 
Sr=i practice, transfer functions in the same subdictionary often bear strong resemblance, e.g., 

for examples). Thus, fji « fjii and | (dj_;,dj i') | « 1, which leads to a 


similar monotic behavior (see Sec. 


VI 


large coherence value. On the other hand, the inner product of atoms from different subdictionaries | (dj d^/ //) | = 
X]r=i is close to zero when the covariates j and j' are “sufficiently independent”|^To circumvent 

the above violation of the recovery condition in Theorem [D due to the large global coherence of the dictionary, we 
first define coherence within and across subdictionaries: 


Definition 1. 


Mantra = max max |(djdj i-)|, 

iG[p] (i,r)G[ij] 


fiinter= Hiax max . 

Using these definitions, we derive the following recovery condition for BC-OMP; 


Theorem 2 (Recovery condition). If the following condition holds: 


l^intra 1 ; 


^More precisely, if we model the covaiiates as random variables Xj and Xj/, then I/tt. seen as a sample 

estimate of the population covariance K[fji{Xj)fj/i/{Xj/)] which will be 0 if Xj and Xj/ are independent. Note that in this argument we 
use the fact that the transfer functions are centered (see Sec.[n|. 
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then BC-OMP recovers the correct atoms and their coefficients. 

The detailed proof can be found in the Appendix. Unlike the recovery condition in Theorem [T] Theorem 
does not depend on the global coherence of the dictionary. This recovery condition is particularly interesting in our 
applications where ptmtra typically takes large values due to strong resemblance between the transfer functions in the 
same subdictionary, while Winter is small when the covariates are sufficiently statistically independent. Interestingly, 
Theorem shows that, in the special case where subdictionaries are orthogonal to each other, the parameter Sintra 
can take values arbitrarily close (but not equal) to 1 and BC-OMP still succeeds in the recovery. In contrast, the 
recovery condition of Theorem [T] is not satisfied in this case since the global coherence p is close to 1. Note that 
our recovery condition is not limited to BC-OMP but also valid for the OMP algorithm, hence it extends existing 
sparse representation theory for an interesting class of sparsity structure. 

Finally, we draw the reader’s attention to some results related to the proposed recovery condition in Theorem 
In II 22 I . the authors provide a new analysis of Matching Pursuit, when the dictionary is built from an incoherent 
union of possibly coherent subdictionaries. A sufficient condition that guarantees the selection of atoms from the 
correct subdictionaries is shown. In other words, the exact recovery of atoms is dropped and a sufficient condition 
for the weaker subdictionary recovery property is shown. This is completely different from our setting, where we 
require the correct atoms to be recovered when the signal contains at most one atom per subdictionary. In ||23, 
the “block sparse” model is introduced; the signals’ non-zero entries appear in blocks rather than being spread 
throughout the vector. Coherence-based recovery conditions for a block version of OMP are shown. Once again, 
our model significantly differs from this one, as we assume one active component per subdictionary (or block), 
whereas the work of ll23l assumes that nonzero entries occur in clusters. 


VI. Experimental results 

In this section, we present experimental results. Baseline methods and practical implementation details of our 
algorithm are explained in Sec. VTA Sec. |VI-B reports results on synthetic data, and the following two sections 
show results on electric load forecasting problems. More background on using additive models for electricity demand 
forecasting can be found in ||24l, 1^ for example. 


A. Experimental setup 

We compare the proposed multi-task learning approach to the following baseline regression methods: 

1) Linear Regression (LR): A linear regressor is learned independently using e-SVR Ehll with a linear kernel 
for each task. The penalty parameter C is set using a cross-validation procedure for each task. We use the 
Liblinear implementation ETIl . 

2) Support Vector Regression with Radial Basis Function kernel (SVR-RBF): We leain an e-SVR-RBF 
regressor independently for each task where the penalty parameter C and kernel bandwidth a are determined 
using cross-validation. We use the LibSVM implementation ll28l . 

3) Independent Additive Models (lAM): An additive model is fitted for each task independently using the 
mgcv package in R E). 
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(a) (b) (c) 


Fig. 3. Estimated transfer functions for the synthetic experiment, using the proposed and lAM methods. The black dots indicate the noisy 
observations used to estimate the transfer functions, and the blue dashed lines represent the true transfer functions. The results are shown for 
a fixed covariate, leading to the L = 3 estimated transfer functions that are illustrated. For the lAM method, the depicted transfer functions 
correspond to estimations obtained on arbitrary tasks involving the true transfer functions. 


4) K-Means and Additive Model (KAM): This is a two-step approach, where in the first step we use the 
K-means algorithm to group the set of tasks into different clusters. In the second step, one additive model is 
learned independently for each cluster centroid. The prediction of a signal is given by the prediction for the 
centroid of the cluster it belongs to. 

We now discuss practical aspects of our algorithm. For simplicity, we have chosen the regularization value v = \ 
in all our experiments (see Eq. |^. Note that a cross-validation procedure is likely to give better results, but is more 
computationally expensive. Moreover, in all experiments we set the parameters Lj = L for all covariates j G [p]. We 
envision that, in real-world applications, the parameters Lj will be manually selected by domain experts (possibly 
in an iterative procedure) to find the optimal trade-off between predictive performance and model interpretability 
from their point of view. In the experiments below, we show results for different values of L to evaluate how the 
choice of Lj affects the predictive performance. Finally, similarly to K-means, our proposed algorithm can incur 
the problem of “empty clusters” when the number L of transfer functions per covariate is large. We circumvent 
this problem by checking at each iteration for unused transfer functions and, when such a transfer function is 
detected, replacing it with the transfer function that leads to minimum error for the task with the currently highest 
approximation error. 

B. Synthetic experiment 

In our first experiment, we generate n = 100 samples according to the multi-task additive model in Eq. Q, with 
p = 10 covariates, N = 200 tasks and Li = ■■■ = Lp = L = 3 candidate transfer functions per covariate. For 
simplicity, we take the covariates to be equal for all tasks (i.e., = Xij for all m G [W]), and randomly 

sample Xij from the uniform distribution in [—1,1]. In this synthetic example, the ground truth transfer functions fji 
are randomly generated smooth functions, and the scaling weights are chosen to be non-negative random numbers. 
Finally, the model noise is iid, and follows the standard normal distribution. 
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Fig. 4. RMSE on test data of the proposed method, and the competing methods for the synthetic experiment. The experiments are performed 
on 50 independent trials. 


We first assess the quality of the estimated transfer functions using our algorithm, and compare it to the ground 
truth transfer functions (known in this synthetic setting), as well as the functions estimated with lAM method 
(which treats each task independently). For a hxed covariate, we show in Fig. Q (a-c) the L = 3 associated transfer 
functions, as well as the proposed and lAM estimations. Clearly, the estimation of the true transfer functions using 
our multi-task approach is much more accurate and resilient to noise than lAM. In fact, the true and estimated 
functions using our approach nearly coincide, despite the fact that observations are highly noisy, and the relatively 
low sample size. We then compare the prediction performance (in terms of Root Mean Squared Error - RMSE) 
of the proposed method to other competitor methods on a test set of 400 samples. The results are illustrated in 
Fig- 1^ The proposed approach leads to signihcantly lower RMSE compared to other approaches on this synthetic 
example. Despite the limited number of training data, our approach correctly detects and leverages the correlation 
between the different tasks to signihcantly improve the results. 

C. Modeling of smart meter data 

In this experiment we use data from a smart metering trial of the Irish Commission for Energy Regulation (CER) 
1291 . The data set contains half-hourly electricity consumption data from July 14, 2009 to December 31, 2010 for 
approx. 5, 000 residential (RES) and small-to-medium enterprise (SME) customers. It comes with survey information 
about different demographic and socio-economic indicator, e.g., number of people living in the household, type of 
appliances, and business opening times. In our experiment, we only use customers that do not have any missing 
consumption data, leaving us with a total of = 4, 066 meters out of which 3,639 are residential and 427 SME 
customers. Since half-hourly smart meter data is very volatile due to the stochastic nature of electricity consumption 
at the individual household level, we aggregate each signal over 6 time points to obtain one measurement every 3 
hours. We split the data into 12 months of training and 6 months of test data. We consider a simple additive model 
with “Hour Of Day”, “Time Of Year” and “Day of Week” as covariates. 

Table |I] shows the average RMSE on the training and test data over all = 4,066 meters. Here, we use L = 5 
for the number of candidate transfer functions per covariate in our approach, and the number of clusters in the KAM 
method. In terms of predictive performance, our proposed approach clearly outperforms LR and KAM; it performs 
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TABLE I 

Average RMSE over all 4,066 tasks in the CER data set, and model complexity. Eor the proposed and KAM method, 
THE results with L = 5 ARE SHOWN. FOR EASIER COMPARISON, THE FOURTH COLUMN SHOWS THE normalized MODEL COMPLEXITY, 
I.E., THE MODEL COMPLEXITY DIVIDED BY Np WHERE N IS THE NUMBER OF TASKS AND p THE NUMBER OF COVARIATES. T IS THE 
NUMBER OF ELEMENTS IN THE SPLINE BASIS (EQUAL FOR ALL COVARIATES, FOR SIMPLICITY). 


Method 

RMSE 

Model complexity 


Training 

Testing 

Theoretical 

Numerical example 

Proposed 

2.6 

2.7 

p{TL + 2N) 

2.01 

LR 

3.1 

3.1 

Np 

1 

SVR-RBF 

2.2 

2.5 

Nn{p + 1) 

> 3800 

lAM 

2.3 

2.6 

pTN 

12 

KAM 

3.1 

3.2 

pTL 

0.01 


only slight worse than lAM albeit using only L = 5 different transfer functions per covariate while lAM learns 
independently one additive model per signal. Note that the methods based on additive models are competitive with 
SVR-RBF, while the latter approach is computationally expensive at training and test time, which makes it only 
moderately suitable for large-scale problems, besides leading to models that are unfortunately difficult to interpret. 

In an attempt to quantitatively compare the interpretability of the different methods. Table |I] shows the model 
complexity of the different methods learned on the CER data, defined as the number of scalar variables needed to 
store the modeQ For easier numerical comparison we divide the numbers by Np, i.e., the number of tasks times 
the number of covariates. As it can be seen, the SVR-RBF is the most complex model, since it depends on the 
number of observations n. lAM also has a high complexity as it fits one additive model per task. On the other 
hand, our proposed approach only needs pTL variables for the transfer functions, and 2Np values to store the 
weights In the experiment on the CER data, this results in a complexity of 2.01, where 0.01 results from 

the representation of the transfer functions, and 2 from storing the weight coefficients for each task. We conclude 
that our proposed approach finds a good trade-off between predictive performance and model complexity. 

Fig. 1^ shows the predictive performance of the proposed algorithm and the KAM method for different values of 
L. One can see that, already for values of L > 2, our method reaches an accuracy that is close to the performance 
of independently learned additive models. The performance of KAM is consistently worse, regardless of the number 
of clusters. 

Fig. 0 (a)-(c) display the transfer functions obtained by our method for L = 2, and Fig. (d) shows the 
corresponding matrix of correlations |D^D| between the atoms of the dictionary estimated using the proposed 

^For the SVR-RBF method, it is assumed for simplicity that the number of support vectors is equal to the number of training points n. 
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Fig. 5. Average RMSE on the CER data set vs. number of candidate functions (per covariate) and clusters L, respectively, for the proposed 
and the KAM method. 



Fig. 6. Transfer functions obtained with L = 2 (a-c), and congelation between atoms (d). (a): Hour of Day, (b): Time of Year (0: January, 1st, 
1: December, 31st), (c): Day of Week. 


algorithnj^ As noted in Section this matrix has a block diagonal structure due to similarites between transfer 
functions depending on the same covariate on the one hand, and independence of different covariates on the other 
hand. We obtain the coherence values fiintra = 0.87 and fimter = 0.01 (see Definition [T]l, satisfying our recovery 
condition for BC-OMP in Theorem while the original condition in Theorem for recovery in OMP is clearly 
not satisfied. 

Let us consider the interpretability of the transfer functions learned using our method, and study correspondances 
with the customer survey information in the CER data set. Table [I^ relates the activation of the “Hour of Day” 
transfer functions to the customer type (residential vs. SME). In this experiment we chose L = 2, resulting in 4 


“final” transfer functions due to the non-negativity trick discussed in Sec. IV-A One can see that, for residential 
customers, overwhelmingly the first transfer function is activated, while the majority of SME signals is modeled 
using the second one. Looking at the shape of the transfer functions, this intuitively makes sense: the consumption 
of residential customers typically peaks in the evening, while SMEs consume most electricity during the day. 
Similarly, Table m shows the correspondence between the activation of the “Day of Week” transfer function, and 
the SME business days (which is available from the CER survey information). Again, there is an intuitive and 
easy-to-interpret correspondance between the learned models and available ground truth information. 


®In this experiment, the observed covariates are equai for aii the tasks (i.e., = ^ij), since Hour of day. Time of year and Day of week 

are cieariy independent of the task at hand. This ieads to a unique dictionary D that is independent of the task. See Sec. E for further details. 
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TABLE II 

Percentage oe the activation of “Hour of Day” transfer functions for residential and SME customers. 



TABLE III 

Percentage of the activation of “Day of Week” transfer functions for SMEs with different business days (see the 

LEFT column). 









1 1 I 1 I 



1 I 1 I . 






Mil* 


1 1 1 i ‘ 

Mon Wed Fri 

Day of week 


Mon Wed Fri 

Day of week 


Mon Wed Fri 

Day of week 


Mon Wed Fri 

Day of week 


Week days only 

1% 

2% 

93% 

4% 

Week days + Saturday 

0% 

0% 

37% 

63% 

All days 

36% 

5% 

19% 

40% 


Finally, we evaluate the performance of our method in a setting where training data is scarce. For this purpose, 
we consider now a training set of n samples that are randomly selected from the CER data, and consider the 
remaining data for testing. Fig. |7] illustrates the testing RMSE of the proposed method (with L = 2) and the other 
competing methods with respect to n. It can be seen that when the training data is scarce, the proposed method 
outperforms lAM, and our method inherits the advantages of traditional multi-task learning by sharing information 
across tasks, and hence avoids overfitting. Note that for larger n, the gap between the two methods decreases, and 
lAM slightly outperforms our approach (with L = 2), as it provides a much more flexible model, which however 
suffers from lack of interpretability. Note finally that our approach consistently outperforms all other competing 
methods (LR, SVR-RBE, KAM) in the range of training samples in Eig. 

D. Intra-signal multi-task learning 

In this last set of experiments, we consider another application of our multi-task learning framework. A common 
approach in hourly electrical load forecasting is to treat each hourly period separately and use different models for 
each hour of the day (see, e.g., Il25ll and 1301 ). Besides the computational burden, such approaches unfortunately fail 
to discover intra-daily commonalities in the electricity consumption during different hours. Moreover, the resulting 
models are difficult to interpret. 

We address those issues using the proposed multi-task framework. Given a signal y G K" representing hourly 
electrical loads, we first reshape the signal into a matrix Y G where the d rows represent the days, and the 
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Fig. 7. Testing RMSE versus number of training samples for the CER experiment. 

TABEE IV 

Training and testing RMSE for the GEECom task. 


Method 

RMSE training 

RMSE testing 

Proposed (L = 4) 

0.11 

0.17 

LR 

0.62 

0.50 

SVR-RBE 

0.11 

0.18 

Additive Model (AM) 

0.13 

0.19 

lAM 

0.11 

0.18 


columns the hours in a day. We then treat the columns of Y as separate tasks, and fit our model using the proposed 
algorithm. For this experiment, we use 4.5 years of data from the GEFCom 2012 load forecasting challenge ED, 
considering “Time Of Year”, “Day of Week” and “Temperature” as covariates. The response variable is set to be 
the sum of the 20 zonal level series expressed in gigawatt. Moreover, the temperature covariates are obtained by 
computing the average signals over the 11 weather stations provided in the data. We use the first d = 1,642 days 
in the data set. The first 4 years of the data are considered for training, and the remaining for testing. 

We compare the proposed approach to single-task regression methods that do not split the signal into hourly 
signals. Specifically, considering an additional “Hour of Day” covariate, we fit linear and nonlinear models using 
LR, SVR-RBF as well as an Additive Model that reads 


Ui = /i(Hour of Day^) + / 2 (Time of Yeari) 
+ /a(Day of Week.j) + / 4 (Temperaturej). 


In addition, we compare the proposed approach to lAM using the same split into hourly signals. 


Table IV shows the result of the comparison. It can be seen that, with L = 4, the proposed approach outperforms 
all other methods in terms of testing RMSE. By splitting the signal into hourly signals, our algorithm yields a testing 
RMSE that has improved roughly by 10% with respect to AM. In addition, our approach slightly outperforms lAM 
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(a) 


(b) 


Fig. 8. Results on the GEFCom data with L = 4. (a) Atemp matrix showing the activation of the temperature transfer functions for the 24 
hours per day. (b) Shapes of the estimated temperature transfer functions. 


in terms of testing accuracy in this experiment, while learning much less transfer functions. Interestingly, our 
algorithm yields a clustering of the hours of the day with some intuitive interpretations: consider the matrix Atemp 
in Fig. 1^ (a) which shows the assignment of the L = A temperature transfer functions (displayed in (b)) to the 24 
signals representing different hours per day. To be more specific, the matrix Atemp is given by 


Atemp — 


(1) 

X(24) 

■ ■ 

'jl 

(1) 

X(24) 

■ ■ ^24 

>j4 • 


where j corresponds to the index of the “Temperature” covariate. Note that the transitions are “smooth”, i.e., 
consecutive hours are typically modeled using the same temperature transfer functions, albeit we did not explicitely 
enforce this property. While all the transfer functions in Fig. (b) have a similar V-shape, there are noteworthy 
differences. For example, TF4 compared to TFl leads to higher load predictions for hot temperatures and lower load 
predictions for cold temperatures. Intuitively, we can interpret TF4 and TFl as representing “air conditioning” and 
“heating” effects, respectively. This corresponds well with the hours for which these two functions are activated: 
TF4 during the day where most air conditioning occurs, and TFl during the night and early morning, where most 
electricity is used for heating. 


VII. Conclusion 

In this paper, we introduced a novel multi-task learning framework for additive models with the key idea to 
share transfer functions across the different tasks. We established a connection between the proposed model and 
sparse dictionary learning and leveraged it to derive an efficient fitting algorithm. We further conducted a theoretical 
analysis of the recovery conditions of the sparse representation step; by distinguishing between coherence within 
and across different subdictionaries, we were able to establish recovery for a wider range of realistic settings that are 
particularly relevant in our multi-task learning problem. Through synthetic experiments, we showed that the proposed 
algorithm correctly estimates the underlying transfer functions, and outperforms competing methods in terms of 
predictive power. In experiments with real-world electricity demand data, we demonstrated that our proposed multi¬ 
task approach achieves competitive performance with baseline methods that learn models independently for each 
task, while providing models that are more interpretable, extracting inherent structure in the tasks (e.g., clustering 
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of tasks corresponding to different customer types), and being more robust in settings where training data are scarce. 
In future work, we plan to improve the scalability of the method to apply it to domains that potentially involve 
millions of tasks. 


Appendix 

A. Proof of Theorem 

Assume that y = prove by induction that the correct atoms are recovered when the 

sufficient condition holds. Assume that, after j G {0,...,p — 1} steps, BC-OMP has recovered correct atoms in 
the support. Therefore, it holds that the residual signal rj G span(di ..., dpand we write 

p 

s=i 

Since y is exactly p-sparse, the residual is non-zero, and ex f 0. The atom selected by BC-OMP at step j + 1 is 
optimal if and only if: 


max max |(r,,dfc_/)| < max |(r,-,dfc 4)] . (8) 

keAj ie[Lk] keAj ’ 

if Ik 


We establish the recovery condition by showing a lower bound to the right hand side and an upper bound to the 
left hand side of Eq. ([^. Note that, for any k G [p]: 


l(rj.dfe,4)| = 


9=1 


— <Xk + Ctg (dg,/g ; dfc.ifc ) 

^ |ct/c| ~ ^ ' |llgl I (dg,Zg; ^k,lk ) I ■ 

g¥=k 

Moreover, by definition, |<^dg < fj,mter for 9 k. It follows that the right hand side of Eq. ^ can be 

bounded as follows: 


max|(rj,dfc,i,)| 


(a) 


max|(rj,dfe,4)| 

feG[p] 


> max 
fcG[p] 


Ffcl 


Pinter ^ ^ | 
gAk 


(b) 

> Hall 


(p l)pmier||a|| CXD 5 


where (a) is due to the fact that atoms that are not in Aj have already been selected, and are therefore 

orthogonal to Vj. Inequality (b) is obtained by bounding each term \ag\ by ||q:||oo- 
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We now exhibit an upper bound to the left hand side term of Eq. We have: 


max max |(r,-,dfc;)| 

k&Ajl(^[Lk]\ 


< max max I (r,-, i) I 

lAik 


= max max 

/cG[p] 

l^lk 


Oik ^ Ug (dg^ig, dfc_i) 

g^k 


< max loffel 

fcG[p] 


Sintra “t“ fainter 




9#fc 


II ^ II oo “t“ (p l)/^znier || ^ || cxd ■ 


Therefore, we obtain the following condition for Eq. to hold; 


l^intra |1 cxd C50 


^ ll^^lloo {p l)/^inier II ^ IIC30 ■ 


Since a 7 ^ 0, we simplify the condition to: 

pintra T Pinter ^ 1- 

As the above condition holds by assumption, we conclude that Eq. ([^ is satisfied and BC-OMP selects a correct 
atom at step j + 1 . 

Once the correct support is recovered, it is straightforward to see that an orthogonal projection onto the span of 
the recovered atoms yields the correct coefficients. Indeed, if we have y = linear independence 

of the atoms {dg imposes 7 ' = 7 ^. This concludes the proof. □ 
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